In this part of the study, we will explore geocoded provider data with respect to socioeconomic and demographic factors such as population density, median income, median age, population with health insurance etc. We will study the distribution of:
In this section, we will:
# Import Libraries
from IPython.display import display
# Import arcgis
import arcgis
from arcgis.gis import GIS
from arcgis.features import FeatureLayer
from arcgis.mapping import WebMap
# Import libraries for data exploration
import pandas as pd
pd.set_option('display.max_columns', 500)
import numpy as np
# Import plotting libraries
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
# Import library for time
import time
# Create a GIS connection
gis = GIS("https://datascienceqa.esri.com/portal", "portaladmin", "esri.agp", verify_cert=False)
# gis = GIS("https://datascienceqa.esri.com/portal", "portaladmin", "esri.agp")
Provider data was geocoded using the GeoAnalytics server. Let's get the geocoded provider data feature layer for exploration.
# Search the feature layer
search_result = gis.content.search('title: provider_data_geocoded_7_30', 'Feature Layer')
search_result
# Get the feature layer item
provider_data_item = search_result[0]
provider_data_item
# Check for layers inside the item
provider_data_item.layers
# Get the layer needed for analysis
provider_data_layer = provider_data_item.layers[0]
provider_data_layer
# Look at the first 5 fields and their data types
for f in provider_data_layer.properties.fields[:5]:
print(f['name'],' ',f['type'])
# Check number of records in `provider_data_layer`
provider_data_layer.query(return_count_only=True)
Let's do a random test to check accuracy of geocoded data. For a given state, we will look at and plot the data points based on address provided vs address geocoded.
# Create a spatially enabled dataframe for WY
%time wy_df = provider_data_layer.query(where="user_Region='WY'", as_df=True)
# %time wy_df = provider_data_layer.query(where="USER_Provider_Business_Practice_Location_Address_State_Name='WY'", as_df=True)
wy_df.shape
# Look at data points that are geocoded outside of Wyoming
len(wy_df[wy_df['region']!='Wyoming'])
# Check the accuracy of geocoding
round(100-((wy_df.shape[1]/wy_df.shape[0])*100),2)
# Create a map
map1 = gis.map('USA')
map1
# Plot data points for WY
wy_df.spatial.plot(map1)
map1.take_screenshot()
map1.remove_layers()
We have an accuracy of >99% with geocoding. From this map, we can see that out of 12770 points that have been geocoded only 3 are outside of WY. Similar analysis can be performed for other states to check for geocoding errors.
# Create a spatially enabled dataframe for AZ
%time az_df = provider_data_layer.query(where="user_Region='AZ'", as_df=True)
# %time az_df = provider_data_layer.query(where="USER_Provider_Business_Practice_Location_Address_State_Name='AZ'", as_df=True)
az_df.shape
# Look at data points that are geocoded outside of Arizona
len(az_df[az_df['Region']!='Arizona'])
# Create a spatially enabled dataframe for TX
%time tx_df = provider_data_layer.query(where="user_Region='TX'", as_df=True)
tx_df.shape
# Look at data points that are geocoded outside of Texas
len(tx_df[tx_df['Region']!='Texas'])
To visualize how healthcare providers are distributed, let's do a heatmap of providers in the United States.
map_usa = gis.map('USA')
map_usa
# Add provider data and create a heatmap
# usa_layer = FeatureLayer("https://datascienceqa.esri.com/server/rest/services/Hosted/provider_clean_data_geocoded_6_19/FeatureServer")
renderer = {"renderer": "autocast", #This tells python to use JS autocasting
"type": "heatmap",
"blurRadius":1, # changes the size of the clusters
"maxPixelIntensity":2,
"minPixelIntensity":0,
"field":None}
renderer["colorStops"] = [{"ratio":0,"color":[63, 40, 102, 0]},
{"ratio":0.25,"color":[167,97,170,179]},
{"ratio":0.50,"color":"#7b3ce9"},
{"ratio":0.75,"color":[222,102,0,179]},
{"ratio":1,"color":[244,204,0,179]}]
# {"ratio":1,"color":"#ffff00"}]
map_usa.add_layer(provider_data_layer,
{ "type": "FeatureLayer",
"renderer": renderer,
})
# Add Legend
map_usa.legend = True
# Remove layers
map_usa.remove_layers()
map_usa.take_screenshot(set_as_preview=True)
The map shows distribution of providers throughout the US. We can see large gaps in Nevada, Idaho, Utah, Wyoming and Texas. These and other states can be explored further for analysis.
Here, we will search for demographic data at the county level. To do this, we will:
# Search for Population data layer
popsearch_result = gis.content.search('title: 2018 USA Population Density')
popsearch_result
# Get Population Density
popdensity = popsearch_result[0]
popdensity
# Check first 5 layers in population Density
popdensity.layers[:5]
# Look at first few field names for county layer
county_layer = popdensity.layers[46]
print('FIELD NAME', '\t\t', 'FIELD ALIAS')
for field in county_layer.properties.fields[:10]:
print(field['name'], '\t\t', field['alias'])
# Get specific attributes for Counties
%time
county_layer = popdensity.layers[46]
county_df = pd.DataFrame()
offset = 0
while offset <= 3000:
chunk_df = county_layer.query(out_fields=['Shape','ST_ABBREV','NAME','ASIAN_CY','AMERIND_CY','AVGHHSZ_CY','AVGHINC_CY','BLACK_CY','EDUCBASECY','HISPPOP_CY',
'MEDAGE_CY','MINORITYCY','OTHRACE_CY','PCI_CY','POPDENS_CY','UNEMPRT_CY','WHITE_CY','SMCOLL_CY',
'ASSCDEG_CY','BACHDEG_CY','GRADDEG_CY','TOTPOP_CY'],return_all_records=False,result_offset=offset,result_record_count=750,as_df=True)
county_df = pd.concat([chunk_df, county_df], ignore_index=True)
offset += 750
county_df.shape
Here, we will search for health expenditure data at the county level. To do this, we will:
# Search for Population data layer
expsearch_result = gis.content.search('title: 2018 USA Health Insurance Spending')
expsearch_result
# Get Healthcare expenditure data
health_exp = expsearch_result[0]
health_exp
# Check first few layers in population Density
health_exp.layers[:5]
# Look at the fields and their data types
health_exp_county_layer = health_exp.layers[46]
print('FIELD NAME', '\t', 'FIELD ALIAS')
for f in health_exp_county_layer.properties.fields[:5]:
print(f['name'], '\t',f['alias'])
# # Get specific attributes for Counties
health_exp_county_layer = health_exp.layers[46]
health_exp_county_df = pd.DataFrame()
offset = 0
while offset <= 3000:
chunk_df = health_exp_county_layer.query(out_fields=['ST_ABBREV','NAME','X8001_A','X8002_A','X8013_A','X8018_A','X8019_A','X8024_A','X8032_A','X13002_A','X13004_A'], return_all_records=False,result_offset=offset,result_record_count=500,as_df=True)
health_exp_county_df = pd.concat([chunk_df, health_exp_county_df], ignore_index=True)
offset += 500
health_exp_county_df.shape
We will spatially join both the demographic and expenditure layer so as to combine data into a single layer.
# Merge demographic and health expenditure data at county level
county_healthexp_df = county_df.spatial.join(health_exp_county_df,how='left', op='within')
# Check geometry type
county_healthexp_df.spatial.geometry_type
# Check shape of data
county_healthexp_df.shape
# Write the data into a file geodatabase
county_healthexp_df.spatial.to_featureclass(r'C:\Users\mohi9282\Desktop\arcgis\demographic_healthexp.gdb\demographic_healthexp_layer')
This feature class can now be published as a feature service to the portal using ArcGIS Pro. The feature service can be used for any further analysis. We will publish this layer as demographic_healthexp_layer.
# Random check data from combined layer
county_healthexp_df[(county_healthexp_df['ST_ABBREV_left']=='NM') & (county_healthexp_df['NAME_left']=='Chaves County')]
# Confirm results with County demographic data
county_df[(county_df['ST_ABBREV']=='NM') & (county_df['NAME']=='Chaves County')]
# Confirm results with health expenditure data
health_exp_county_df[(health_exp_county_df['ST_ABBREV']=='NM') & (health_exp_county_df['NAME']=='Chaves County')]
Now, we would like to get the provider count for each county. Provider count is a key variable and we can get the count in multiple ways:
- Method 1: We can query the provider data layer, store results in a spatially enabled dataframe and generate counts.
- Method 2: Use `AggregatePoints` tool from the GeoAnalytics Desktop Toolbox.
Method 1 fails due to the size and memory usage required for our data. This is where the power of GeoAnalytics Toolkit shines.
AggregatePoints tool helps us aggregate point data (providers) onto polygon data (counties). The output generated is a polygon layer with some aggregated statistics.
We will run AggregatePoints with provider data as our point layer and demographic_healthexp_layer as the polygon layer.
In simple words, this tool will grab all the providers in a particular county and give us a count of providers for that county along with other demographic and health expenditure data that was in the polygon layer for counties.
Learn more about GeoAnalytics Desktop.
import arcpy
# Run Aggregate Points tool
arcpy.gapro.AggregatePoints(r"C:\Users\mohi9282\Documents\ArcGIS\Projects\GWR test\npi_provider.gdb\allstates_provider",
r"C:\Users\mohi9282\Desktop\arcgis\aggregate_points.gdb\agg_points_NB", "POLYGON",
r"C:\Users\mohi9282\Desktop\arcgis\demographic_healthexp.gdb\demographic_healthexp_layer",
'', None, None, None, None, None)
This aggregated data layer can now be published as a feature service to the portal and can be used for further analysis. We will publish this layer as demo_healthexp_allproviders.
Get the aggrageted data layer that includes provider count and other demographics data
# Search for the layer
allsearch_result = gis.content.search('title: demo_healthexp_allproviders')
allsearch_result
# Get the layer
allprovider = allsearch_result[0]
allprovider
allprovider.layers
# Look at the top 5 fields in the layer
allprovider_layer = allprovider.layers[0]
for f in allprovider_layer.properties.fields[:5]:
print(f['name'], '\t',f['alias'])
# Store data from layer as a spatially enabled dataframe
allprovider_df = allprovider_layer.query(as_df=True)
allprovider_df.shape
# Look at the first few rows of dataframe
allprovider_df.head()
We can see that the dataframe (allprovider_df) has 3139 rows and 41 columns.
Here we will:
# Check for null values
allprovider_df.isnull().sum()
# Check for duplicate columns
allprovider_df.columns
We can see that the OBJECTID, ST_ABBREV AND index columns have duplicates. We will drop these duplicate columns from our data.
# Drop duplicate columns
allprovider_df.drop(['objectid1','objectid','index_right','objectid_right','st_abbrev_right','name_right',
'SHAPE__Length', 'SHAPE__Area'], axis=1, inplace=True)
allprovider_df.columns
# Rename Columns
allprovider_df.rename(columns={'x8001_a':'avg_healthcare','x8002_a':'avg_healthinsurance','x8013_a':'avg_medicare',
'x8018_a':'avg_medicalcare','x8019_a':'avg_medicalsrvc','x8024_a':'avg_labtest',
'x8032_a':'avg_presdrug','x13002_a':'avg_personalinsurance','x13004_a':'avg_socsecurity',
'asian_cy':'asian_pop','amerind_cy':'amerind_pop','avghhsz_cy':'avg_hhsz',
'avghinc_cy':'avg_hhinc','black_cy':'black_pop','educbasecy':'edubase',
'hisppop_cy':'hisp_pop','medage_cy':'median_age','minoritycy':'minority_pop',
'othrace_cy':'otherace_pop','pci_cy':'percap_income','popdens_cy':'pop_density',
'unemprt_cy':'unemp_rate','white_cy':'white_pop','smcoll_cy':'some_college',
'asscdeg_cy':'asso_deg','bachdeg_cy':'bach_deg','graddeg_cy':'grad_deg','totpop_cy':'total_population',
'st_abbrev_left':'state','name_left':'county','count':'provider_count','objectid_left':'objectid'}, inplace=True)
allprovider_df.columns
allprovider_df.shape
allprovider_df.spatial.plot()
This map shows all countiues with demographic and health expenditure information for each county.
# Write the data into a file geodatabase
allprovider_df.spatial.to_featureclass(r'C:\Users\mohi9282\Desktop\arcgis\demographic_healthexp.gdb\demographic_healthexp_clean_allproviders')
# Search for the data layer
allsearch_result = gis.content.search('title: demographic_healthexp_clean_allproviders')
allsearch_result
# Get the layer
allprovider = allsearch_result[1]
allprovider
allprovider_layer = allprovider.layers[0]
for f in allprovider_layer.properties.fields[:5]:
print(f['name'], '\t',f['alias'])
# Store data from layer as a spatially enabled dataframe
allprovider_df = allprovider_layer.query(as_df=True)
allprovider_df.shape
# Create a new column - People per Provider
allprovider_df['people_per_prov'] = allprovider_df['total_population']/allprovider_df['provider_count']
allprovider_map = gis.map('USA', 4)
allprovider_map
From this map, we can see that:
# Define Renderer
allProvTest = {"renderer": {
"type": "classBreaks",
"field":"people_per_prov",
"transparency":.5,
"minValue":1}}
# Define Manual Class breaks
allProvTest['renderer']["classBreakInfos"] = [{
"classMaxValue": 100.00,
"label": "0 - 100.00",
"description": "0 - 100.00",
"symbol": {
"type": "esriSFS",
"style": "esriSFSSolid",
"color": [255,247,236,178.5],
"outline": {
"style": "esriSLSSolid",
"type": "esriSLS",
"color": [128,128,128,255],
"width": 0.05
}
}
}, {
"classMaxValue": 200.00,
"label": "100.001 - 200.00",
"description": "100.001 - 200.00",
"symbol": {
"type": "esriSFS",
"style": "esriSFSSolid",
"color": [253,220,174,178.5],
"outline": {
"style": "esriSLSSolid",
"type": "esriSLS",
"color": [128,128,128,255],
"width": 0.05
}
}
}, {
"classMaxValue": 350.00,
"label": "200.001 - 350.00",
"description": "200.001 - 350.00",
"symbol": {
"type": "esriSFS",
"style": "esriSFSSolid",
"color": [252,177,123,178.5],
"outline": {
"style": "esriSLSSolid",
"type": "esriSLS",
"color": [128,128,128,255],
"width": 0.05
}
}
}, {
"classMaxValue": 500.00,
"label": "350.001 - 500.00",
"description": "350.001 - 500.00",
"symbol": {
"type": "esriSFS",
"style": "esriSFSSolid",
"color": [241,109,75,178.5],
"outline": {
"style": "esriSLSSolid",
"type": "esriSLS",
"color": [128,128,128,255],
"width": 0.05
}
}
}, {
"classMaxValue": 1100.00,
"label": "500.001 - 1100.00",
"description": "500.001 - 1100.00",
"symbol": {
"type": "esriSFS",
"style": "esriSFSSolid",
"color": [200,28,18,178.5],
"outline": {
"style": "esriSLSSolid",
"type": "esriSLS",
"color": [128,128,128,255],
"width": 0.05
}
}
}]
# Plot Map using defined Renderer
allprovider_map.remove_layers()
allprovider_df.spatial.plot(map_widget=allprovider_map, renderer=allProvTest['renderer'])
allprovider_map.legend=True
allprovider_map.take_screenshot()
An alternate way to create the map without define a renderer is to directly plot the spatial dataframe by using Esri's classification algorithm. Here we show the use of esriClassifyNaturalBreaks which breaks the column values naturally into different classes.
# allprovider_df.spatial.plot(map_widget=allprovider_map,
# renderer_type='c', # for class breaks renderer
# method='esriClassifyNaturalBreaks', # classification algorithm
# class_count=6, # choose the number of classes
# col='people_per_prov', # numeric column to classify
# cmap='OrRd', # color map to pick colors from for each class
# alpha=0.7 , # specify opacity
# linewidth = 0.05)
The second largest state in U.S. both by area and population, Texas baosts of 261,231.71 sq mi land area with a population of ~28.7 million. The population density is low with just 105.2 people per square mile and varies drastically between heavely populated urban areas to sparsely populated rural.
As seen in the previous notebook, Texas has the fourth largest number of healthcare providers in U.S. However, the state stands second highest on Health Resources and Services Administration's (HRSA) list of shortage areas. Texas also tops HRSA's list of medically underserved areas.
Let's start our journey with exploring Texas!
Get the Data Layer
# Store data from layer as a spatially enabled dataframe
txprovider_df = allprovider_layer.query(where="state='TX'", out_fields=['objectid','state','county','avg_hhinc',
'median_age','total_population','provider_count'],as_df=True)
txprovider_df.shape
Let's explore distribution of providers with respect to population density.
txprovider_df['people_per_prov'] = txprovider_df['total_population']/txprovider_df['provider_count']
tx_pop_map = gis.map('Texas, USA', zoomlevel=4)
tx_pop_map
From this map, we can see that
- Irion County (in red) seems to be the worst with 799 people per healthcare provider.
- Borden, Glasscock, San Jacinto, Newton are some counties (dark rust) with 500-700 people per healthcare provider.
- There are very few counties (in white) in Texas with less than 100 people per healthcare provider.
# Define Renderer
txpopTest = {"renderer": { #This tells python to use JS autocasting
"type": "classBreaks",
"field":"people_per_prov",
"transparency":.5,
"minValue":1}}
# Define Manual Class breaks
txpopTest['renderer']["classBreakInfos"] = [{
"classMaxValue": 100.00,
"label": "0 - 100.00",
"description": "0 - 100.00",
"symbol": {
"type": "esriSFS",
"style": "esriSFSSolid",
"color": [255,247,236,178.5],
"outline": {
"style": "esriSLSSolid",
"type": "esriSLS",
"color": [128,128,128,255],
"width": 2
}
}
}, {
"classMaxValue": 300.00,
"label": "100.001 - 300.00",
"description": "100.001 - 300.00",
"symbol": {
"type": "esriSFS",
"style": "esriSFSSolid",
"color": [253,220,174,178.5],
"outline": {
"style": "esriSLSSolid",
"type": "esriSLS",
"color": [128,128,128,255],
"width": 2
}
}
}, {
"classMaxValue": 500.00,
"label": "300.001 - 500.00",
"description": "300.001 - 500.00",
"symbol": {
"type": "esriSFS",
"style": "esriSFSSolid",
"color": [252,177,123,178.5],
"outline": {
"style": "esriSLSSolid",
"type": "esriSLS",
"color": [128,128,128,255],
"width": 2
}
}
}, {
"classMaxValue": 700.00,
"label": "500.001 - 700.00",
"description": "500.001 - 700.00",
"symbol": {
"type": "esriSFS",
"style": "esriSFSSolid",
"color": [241,109,75,178.5],
"outline": {
"style": "esriSLSSolid",
"type": "esriSLS",
"color": [128,128,128,255],
"width": 2
}
}
}, {
"classMaxValue": 800.00,
"label": "700.001 - 800.00",
"description": "700.001 - 800.00",
"symbol": {
"type": "esriSFS",
"style": "esriSFSSolid",
"color": [200,28,18,178.5],
"outline": {
"style": "esriSLSSolid",
"type": "esriSLS",
"color": [128,128,128,255],
"width": 2
}
}
}]
# Plot Map using defined Renderer
tx_pop_map.remove_layers()
txprovider_df.spatial.plot(map_widget=tx_pop_map, renderer=txpopTest['renderer'])
#### FOR AUTOMATIC RENDERING - manual class breaks are not used
# txprovider_df.spatial.plot(map_widget=tx_pop_map,
# renderer_type='c', # for class breaks renderer
# method='esriClassifyNaturalBreaks', # classification algorithm
# class_count=8, # choose the number of classes
# col='people_per_prov', # numeric column to classify
# cmap='OrRd', # color map to pick colors from for each class
# alpha=0.7 , # specify opacity
# )
tx_pop_map.legend=True
tx_pop_map.take_screenshot()
tx_pop_map.remove_layers()
Let's create a density plot to explore providers with respect to avg. household income.
g = sns.jointplot(x='people_per_prov', y='avg_hhinc', data=txprovider_df,xlim=[0,400], ylim=[35000,100000] ,kind='kde', n_levels=8)
g.fig.suptitle('Kernel Density Plot of People per Provider vs Avg. Household Income', y=1.05)
From this plot we can see that for majority of the counties in Texas the average number of people per healthcare provider vary from 100-200 and the average household income for these counties vary from 60,000 - 70,000 dollars.
Let's create a density plot to explore providers with respect to median age.
g = sns.jointplot(y='people_per_prov', x='median_age', data=txprovider_df, ylim=[0,400], xlim=[20,60] ,kind='kde')
g.fig.suptitle('Kernel Density Plot of People per Provider vs Median Age', y=1.05)
From this plot we can see that for majority of the counties in Texas the median age is between 35-45. The average number of people per healthcare provider vary from 60 - 210 for these counties.
To summarize, we saw that: